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Abstract. Evaluation of link prediction methods is a hard task in very large complex 
networks because of the inhibitive computational cost. By setting a lower bound of 
the number of common neighbors (CN), we propose a new framework to efficiently and 
precisely evaluate the performances of CN-based similarity indices in link prediction 
for very large heterogeneous networks. Specifically, we propose a fast algorithm based 
on the parallel computing scheme to obtain all the node pairs with CN values larger 
than the lower bound. Furthermore, we propose a new measurement, called self¬ 
predictability, to quantify the performance of the CN-based similarity indices in link 
prediction, which on the other side can indicate the link predictability of a network. 
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1. Introduction 

The scale and complexity of real-world systems such as social systems grow 
unprecedentedly, which makes the prediction of such systems more and more challenging, 
and on the other hand attracts more and more attention from both industry and research 
[D El El H El E]. Link prediction problem is described as quantifying the likelihood of 
yet unknown associations between individuals in networks. Generally, there are two 
kinds of link prediction problems miH]. The hrst one is predicting the future links 
given the current topology of a network. The second one is inferring the missing 
links that are likely to exist in a static snapshot of a network. Link prediction has 
wide applications in social systems [91EQ], biological systems mills], scientihc systems 
mm, etc. For example, in recommender systems mm, they could either suggest 
people (items) whom you might hnd interesting enough, or the friends (items) you have 
already known, but just not yet connected online. In national safety applications, link 
prediction can help to identify the hidden terrorist groups or criminal relations mm- 
In bioinformatics, link prediction is employed to hnd the interactions between proteins 
[19] or the side effects of drugs [20] • scientihc research, link prediction can be used 
to hnd the experts m or future potential coauthors [221E3] based on the current co¬ 
authorship networks. In all these applications, link prediction facilitates the evolution 
or enhances the integrality of related systems. 

Link prediction methods may be broadly divided into three groups [8] : maximum 
likelihood algorithms, probabilistic models, and similarity-based strategies. Compared 
with the hrst two types of approaches, similarity-based strategies seem more promising, 
because of their simplicity and relatively lower computational cost. In the similarity- 
based methods, unconnected node pairs are assigned the similarity scores, and the node 
pairs with high similarity scores are assumed to be linked by edges with high probability. 
However, it is hard to dehne node similarity indices, partly because node attributes are 
not easy to be obtained. Thus, many researchers aim to propose similarity indices 
only with the knowledge of network structure, which can be further categorized into 
three types according to the amount of information used in the similarity computation: 
local indices [211 [25l EHl EZ], global indices [28l EHl EQl El], and quasi-local indices 
[321 EEl El]. Local indices require the information of the local structure of nodes to 
determine the similarity of nodes. The hrst and most widely studied local index is the 
common neighbors (CN) index [7], which quantihes the similarity of a pair of nodes 
as the number of neighbors they have in common. Many other local indices are the 
extensions of the CN index [H]. 

Although local indices are promising in link prediction, they are hard to be 
evaluated in large real-world networks. First, the current evaluation framework require 
to calculate the similarities of a large number of node pairs with the worst time 
complexity of 0(|l/p), where V is the node set [8]. Even if we use the multi-core 
cluster in the computation, the cost only decreases at most linearly. On the other 
hand, the current metrics such as the precision [35], AUC [36], etc. for evaluating the 
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performance of similarity indices have limitations when applied to large-scale networks. 
The number of edges in the complement graph of a large real-world network is usually 
much larger than that in the original network, which makes the value of precision tend 
to zero, and leads to a large variance of AUC. Recently, several fast link-prediction 
algorithms [Ml ET] based on MapReduce computational model [3H] were proposed in 
the literature. These algorithms are demonstrated to work efficiently in the calculation 
of CN-based similarity indices, since they delicately divide the computation into the 
map phase and the reduce phase, and the computation may be parallelized into clusters 
of many machines. However, these fast algorithms do not reduce the time complexity 
essentially, and they are still not efficient for large dense networks with millions of nodes. 

In fact, most of the real-world networks have heterogeneous topological structures, 
and link recommendation usually happens in the relatively dense areas of the networks. 
For example, we are more interested in individuals which have a large number of friends, 
and we would like to recommend these “hot” individuals to the others. Based on 
these facts, we propose a bounded link prediction framework, in which we set a lower 
bound of CN values. With the lower bound, we propose a fast parallel algorithm for 
calculating CN values based on the MapReduce model. Also, we propose a new metric 
for evaluating the performance of the CN-based indices. With our fast algorithm and 
the new measurement, we can efficiently and precisely evaluate the CN-based indices in 
link prediction for very large real-world networks. 

2. Traditional algorithms based on MapReduce 

In this section, we discuss the traditional MapReduce based algorithms [Ml IM] 
for calculating the CN values and discuss their limitations from the perspective of 
computational complexity. 

MapReduce [38] is a widely used programming model for dealing with searching, 
sorting, and many other tasks related to large-scale datasets. Programmers hnd the 
MapReduce system easy to use in that it automatically parallelizes the tasks across 
large-scale clusters of machines, handles machine failures and schedules inter-machine 
communications. User only needs to specify the computation task in terms of a map 
and a reduce function. The map function takes an input key/value pair and produces 
a set of intermediate key/value pairs. Then, all the intermediate pairs are grouped by 
the key and passed to the reduce function. In the reduce function, the values for a key 
are merged together to form a smaller set of values. For the traditional pair generating 
algorithm [M], node indices and node adjacencies are specihed as the key/value inputs 
of the map function. In the phase of “map”, neighbors of a node are paired with each 
other, and each node pair is taken as the key and assigned a value (score) “1”. In the 
phase of “reduce”, the values for each key are summed which are the desired CN values. 

This algorithm can be further improved by means of vectorization 133 . In the 
vectorization algorithm, the value of a key is set to be an accompanied group (see Fig. 
1 for the illustration of accompanied groups). Thus, the times of data transmission 
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from the map function to the reduce function are greatly reduced. For example, assume 
there are several node pairs incident with node v7 which are (nl,n7), {v2,v7), (n4,n7), 
{v5,v7). For the pair generating algorithm, these node pairs are all different keys 
and will be sent to the reduce function one by one. However, in the vectorization 
algorithm, these node pairs are transformed into a key/value pair, where the key is 
node v7 and the related value is n7’s accompanied group {nl, n2, n4, n5}, and then the 
intermediate key/value pair are emitted to the reduce function. The pseudocodes for 
the pair generating algorithm and the vectorization algorithm are in Appendix A. 

We present a small example to further illustrate the above two algorithms. As 
shown in Fig. 1, the given network contains 8 nodes and 15 edges. First, we get 
the node adjacencies based on the given network. Then, with the node adjacencies as 
the inputs, we get all the non-zero CN values for the given network by using the pair 
generating algorithm and the vectorization algorithm respectively. 

Compared to the other non-parallel algorithms, the MapReduce based algorithms 
are tested to be very efficient in the calculation of CN-based similarity indices. The 
reason is that only the node adjacency information is needed in the calculation of 
CN values, and thus these MapReduce based algorithms can easily parallelize the 
computation in clusters of machines. Assume a graph G{V,E), where V is the node 
set and E is the edge set. The average degree is {k) = 2|i7|/|l/|. In the best case, all 
the nodes have the same node degree {k). Then, the number of node pairs generated 
based on the adjacency of a node is C((/c), 2) = {k){{k) — l)/2, and the total number of 
generated node pairs is {k){{k) — l)|R|/2. Thus, the time complexity in the best case 
is 0(|F'| * {k)). Actually, in a sparse network, if all the degrees of nodes are less than a 
constant value, the time complexity is close to 0(|1/|). The time complexity increases 
with the heterogeneity of degree distribution. This can be illustrated by the following 
inequation: 

C{{k), 2) + C{{k),2) < C{{k) + A, 2) + C{{k) - A, 2), A > 0, (1) 

which always holds. Under the parallel computing environment, the time complexity 
decreases by a constant factor depending on the number of machines. Compared to the 
pair generating algorithm, the vectorization algorithm only reduces the transmission 
times from 0(|Up) to 0(|Up) (in the worst case), while the total numbers of generated 
node pairs for the two algorithms are the same. Thus, the time complexity of the 
vectorization algorithm is the same as that of the pair generating algorithm, and it 
is faster than the pair generating algorithm only by a constant factor. Based on the 
above analysis, we obtain that the large time complexity of the CN-based link prediction 
methods results from the generation of large number of node pairs. In the worst case, if 
every node pair do have common neighbors, the lower bound of the ordinary complexity 
for any algorithm is 12(1 Up). 
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Figure 1. Illustration of the procedure for calculating the CN values with the pair 
generating algorithm and the vectorization algorithm, (a) is the given network, (b) 
shows the node adjacencies of the network. For instance, “Adj(riO)” represents the 
adjacencies of node wO, which is “{ul, W2, nd, wS, uT}”. (c) and (d) are the outputs of 
the map function for the pair generating algorithm and the vectorization algorithm 
respectively. For instance, in the first line of (c), “(ul, u2)(r!l, ?;4)(nl, w5)(til, r;?)” 
represents the node pairs incident with node ?;!, which are generated based on Adj{uO). 
In the first line of (d), “Aco(r!7)” represents the accompanied group of node v7, which is 
{nl, v2, nd, wS}. Assuming the adjacencies of a node is {sq, si, S 2 , ■ ■ •}, the accompanied 
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3. Our algorithm 

In this part, we firstly introduce the idea of our algorithm which is originated from 
the scale-free properties of real-world networks. Then, we present two lemmas which is 
related to the hltering operation in our algorithm. Finally, we introduce the four steps 
of our fast algorithm. 

3.1. The idea 

Most of real-world networks have heterogeneous topological structures, and present the 
scale-free property [39] • In a scale-free network, a relatively small fraction of nodes 
called hubs have a large number of neighbors, while most of the other nodes just have a 
small number of neighbors. The node degree distribution of a scale-free network obeys 
the power-law. The imbalance of node degrees becomes more serious with the evolution 
of the networks according to the preferential attachment rule mi. The heterogeneity 
of the topological structures of many real-world networks is further magnihed by the 
distributions of the CN values. Fig. 2 presents the simulation results of the CN 
distributions for six large-scale real-world networks. Clearly, we see that for all the 
six networks most of the node pairs just have small CN values, while a relatively small 
number of node pairs have large CN values. Generally, nodes incident with the node 
pairs of small CN values may have small degrees, and these small-degree nodes form 
the sparse area of a network. However, nodes incident with the node pairs of large CN 
values should have large node degrees, and they constitute the dense area of a network. 
Usually, the dense area better embodies the organization rule of a network than the 
sparse area. On the other hand, in real situation we are more likely to recommend hot 
individuals to the others, and the probability of interaction between hot individuals is 
much larger than that between inactive individuals (which just have few connections 
with others). Based on these facts, we introduce a lower bound L of CN values in link 
prediction to hlter in advance the node pairs which originally have very small chance 
to be connected by edges. Since in most real-world networks, a large fraction of node 
pairs have small CN values, plenty of node pairs of CN values no greater than L will be 
hltered, which makes the computational cost greatly reduced. 

3.2. Lemmas 

In our algorithm, we hlter the node pairs with CN values no greater than L in advance 
based on two lemmas, which are as follows: 

Lemma 1. In a network, if the number of neighbors of a node is no greater than 
L, we can simply ignore the node pairs that contain this node, since these node pairs 
can not have more than L common neighbors. 

In the implementation, we hlter the node adjacencies of those nodes which have 
no greater than L neighbors. Note that we can not remove those nodes from the other 
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Figure 2. Log-log plot of CN distributions for six large-scale real-world networks. 
The CN values of all the node pairs, either connected by edges or not, are computed. 
The statistics of these networks are presented in Table 2. 


node adjacencies either, since those nodes may be the common neighbors of the other 
node pairs. 

Lemma 2. In the remaining network (after hltering the original network based on 
Lemma 1), if a node appears at most in L node adjacencies, this node can not be in 
the desired node pairs. Lemma 2 is the inverse presentation of Lemma 1. 

Proof of Lemma 2. Let’s assnme that a node u is in the node adjacencies. Then 
u has at least one accompanied group (Based on our dehnition of the accompanied 
group, the hrst node in the node adjacencies has no accompanied group. However, we 
can simply modify the order of the nodes in the node adjacencies to ensure that u has 
an accompanied group.) For instance, in Fig. 1(d) node v7 has 3 accompanied groups 
marked with Aco(n7). Also, in an accompanied group of node u every node is unique, 
while for several accompanied groups of node u, there may be overlapping nodes. For 
instance, in Fig. 1(d) node vl appears in two accompanied groups of node n5. For 
an accompanied group of node u, every element i will be used to generate a node pair 
(i,M) with score 1. Thus, the total score (or CN value) of node pair (qu) is dehnitely 
no greater than the number of accompanied groups that node u has. Therefore, if 
node u appears at most in L node adjacencies (which means u has no greater than L 
accompanied groups), the CN value of any node pair {*,u) will be no greater than L. 
Then, these node pairs (*,«) can be hltered in advance. 
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Figure 3. Illustration of the first step of our algorithm. The given network is the 
same as in Fig. 1(a). The lower bound of CN is L = 3. We see that node adjacencies 
Adj{u5), Adj(ri6), and Adj(t!7) highlighted in bold are filtered. The residual node 
adjacencies (b) can be visualized by a directed network (c). 


3.3. Steps of our algorithm 

Our fast algorithm is an improved version of the traditional MapReduce-based 
algorithms [3ll |37] in that it filters the irrelevant node pairs with the above two lemmas, 
and focuses on dealing with the node pairs of CN > L under the MapReduce scheme. 
Specifically, our fast algorithm can be divided into the following four steps. 

Step 1; we hlter the original network based on Lemma 1. Then, the node 
adjacencies of size larger than L are reserved. The filtering of the node adjacencies 
makes the original undirected network become a directed network, as shown in Fig. 3. 

Step 2 : we change the directions of all the edges in the directed network, and 
generate the new node adjacencies based on the modified directed network, as shown in 
Fig. 4. 

Step 3: based on the new node adjacencies from Step 2, we hrstly generate all the 
accompanied groups. In shared memory environment, we can just emit the addresses 
and the sizes of accompanied groups instead of the contents of accompanied groups 
in order to further reduce the amount of data transmission. Then, the intermediate 
key/value pairs is set to be “(c, Adj(a), h ^)”, where c is a node in Adj(a), and h is 
the ranking of c in Adj(a), which is equal to the size of Aco(c). Then, we hlter the nodes 
(and their related accompanied groups), which have less than L accompanied groups 
according to Lemma 2, as shown in Fig. 5. 

Step 4: based on the residual accompanied groups from Step 3, we hnally obtain 
the desired key/values pairs, where key is the node pair, and value is the CN value. 
Note that although we execute the hlter operations in Step 1 and Step 2 respectively, 
there might still be some node pairs of CN values no greater than L in the hnal results, 
as shown in Fig. 6. 









9 


(a) 
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Figure 4. Illustration of the second step of our algorithm, (a) is the new directed 
network generated by changing the directions of all the edges of the directed network 
in Fig. 3(c). (b) is the corresponding node adjacencies of (a). Note that Adj(u5), 
Adj(n6), and Adj(?;7) filtered in Step 1 appear again. 



Figure 5. Illustration of the third step of our algorithm, (a) presents the node 
adjacencies generated in Step 2. (b) shows the accompanied groups generated based 
on (a), (c) gives the “address and size” representation of accompanied groups. For 
example, Aco(u4): ^ Adj(?;0), 2 ^ equals Aco(u4): {vl,v2}. (d) is the ordered version 
of (c). Aco(t;I) and Aco(z;2) highlighted in bold are filtered, since both of their numbers 
are 3 which is no greater than L (L = 3). 
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(a) (b) 



Figure 6. Illustration of the fourth step of our algorithm, (a) shows the residual 
accompanied groups, (b) shows the final generated node pairs and their related CN 
values, (c) presents the desired results. L = 3. 



L L L 



L L L 


Figure 7. Execution time T vs lower bound L for large-scale real-world networks. 
Multi-core processors are used in the computation. The core number is increasing from 
1 to 16. 


The first three steps of our algorithm are very fast. Most of the computational cost 
lies in Step 4. However, after the twice hltering, the input data of Step 4 is greatly 
reduced. On the other hand, calculations based on the accompanied groups of different 
nodes are independent, which makes them easy to be parallelized. The efficiency of our 
algorithm is also dependent on L. Clearly, we have 0 < L < kmax, where kmax is the 
maximum node degree in the network. Large L means that a lot of node pairs will 
be hltered, which makes our algorithm very fast. The pseudocode of our algorithm is 
shown in Table 1. 

4. Implementation 

In our computing environment, there is a shared-memory machine with multi-core 
processors running 64-bit Linux with 8GB of memory. Each task is allowed to use 
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Step 1 

Step 2 

map(M, adjJist)^]) { 
if (adj_hst[M].size > L) 
emit(M, adjJist)^]) 
else 

just remove vertices u and 
adjJist [u] 

} 

map(n, adjJist)^]) { 
foreach v in adjJist)^] 
emit(n, u); 

} 

Step 3 

map(n, adj_hst[n]) { 

for i = 1 to adj Jist[v].size()—1 

emit(adj-list <C adj_hst[n], i S>); 

} 

reduce(x,aco_hst[x]) { 
if(acoJist[a;].size > L ) 
emit(a;, acoJist [x]) 

} 

Step 4 

map(x, aco_hst[x]){ 


hs = hash_map() 


foreach <C addr, leu 3> in acoJist[a;] 

for i = 1 to leu—1 


hs[(*addr)[i]]-|--|- 


foreach(key, value) in hs 


if(value> L) 


emit ((key, x), value) 


} 



Table 1. The pseudocode of our fast algorithm 


2GB of memory at most. The time complexity of each of the hrst three steps in our 
algorithm is 0(|£'|). Thus, the hrst three steps can be executed in serial or by employing 
parallel computing schemes such as MapReduce or Resilient Distributed Datasets im 
(RDD, see Appendix B). The fourth step of our algorithm has the time complexity of 
0(|£'| * (k)), which is an order of magnitude larger than that of the hrst three steps. 
On the other hand, in the fourth step the computation of CN values can be partitioned 
into many separate running tasks, which make it easy to parallelize the computation in 
our environment. We test the performance of our algorithm on six large-scale real-world 
networks [l2l |33l HU |15], of which the statistics are shown in Table 2. We show the 
results of L vs. execution time T (time for the computation of CN values) in Fig. 7. We 
see that T is very small, and decreases fast with increase of L. For examle, Gplus has 
hundreds of billions of edges, while T of it is just around several tens of seconds. When 
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Network 

Nodes 

Edges 

CA-AstroPh[42j 

18,772 

198,110 

FacebookjiS] 

4,039 

88,234 

Web-Go ogle [44j 

875,713 

5,105,039 

Gphis[48] 

107,614 

13,673,453 

Higgs-twitter[45j 

456,631 

14,855,875 

Web-Stanford|44] 

281,903 

2,312,497 


Table 2. The statistics of six large-scale real-world networks. 


multi-core processor is used in the computation, T linearly decreases to a few seconds, 
as is shown in Fig. 7(d). 


5. Application 

In this part, we hrstly discuss the limitations of two widely used metrics in link 
prediction, which are precision and AUC. Then we propose a new metric for evaluating 
the link prediction in very large networks. We name this new metric as self-predictability, 
since this quantity can reflect the predictability of a network. Finally, we show some 
simulation results on the self-predictability of several real-world networks. 


5.1. Limitations of traditional metrics 


Link prediction aims to quantify the likelihood of the existence of a link between two 
disconnected nodes. To evaluate the prediction ability of a similarity index, the original 
edge set E is usually randomly divided into two parts: a training set E'^ and a probe set 
E^. Clearly, 77^ U E^ = E and E'^ fl E^ = 0. Assume that a universal set U contains 
all the |I7|(|I/| — l)/2 links. Then, U — E is, the nonexistent link set. Each of the links 
in U — E^ is given a similarity score based on the given similarity index. Precision is 
dehned as the fraction of links, which belong to 77^, in the top r links. AUC is dehned 
as the possibility that a link randomly chosen from E^ has a larger similarity value than 
a link randomly chosen from U — E. Usually, we conduct n independent comparisons. 
Suppose that there are n' times that the link from E^ has a larger similarity value than 
the link from U — E, and n" times that they have the same score, then AUC is calculated 
as follows: 


AUC = 


n' + 0.5n" 


n 


( 2 ) 


For real-world networks, the average node degree {k) are usually far smaller than 0|U|. 
Thus, when the network size \V\ is large, \U — E\ is much larger than |77|. Then, we 
get: 


\ep\ 

\U-ET\ 


—)■ 0, if |U| —)■ oo. 


(3) 
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(a) (b) 




Figure 8. Two extreme cases for illustration. In (a), if any two nodes have common 
neighbors, they are connected by an edge definitely. Thus, the self-predictability <5 of 
(a) equals 1. In (b), if any two nodes have common neighbors, they are not connected 
by an edge. The self-predictability <5 of (b) is 0. 


This equation indicates that when \V\ is large enough, precision should be very small 
and close to zero. For AUC, if \U — E\ is large, the CN values of links in U — E have 
large variance. This means that in order to obtain an accurate AUC, the comparison 
times n should be large enough, which requires a large computational cost. Therefore, 
for very large real-world networks, it is inadequate to use precision or AUC to evaluate 
the performance of similarity indices in link prediction. 


5.2. Self-predictability 

We propose a new metric, namely self-predictability, to evaluate the performance of a 
similarity index in link prediction. The definition of self-predictability 6 is as follows: 

, |F(G,L)nG| 

|F(G,L)| ■ ^ * 

where G is the original graph, and L is the lower bound of CN values. E{G, L) is the 
function of G and L, which generates the node pairs of CN > L. The calculation of self¬ 
predictability is much easier than that of precision and AUC in that it needs not to divide 
the network into the train and probe sets. Therefore, self-predictability requires lower 
computational complexity than precision and AUC. Also, self-predictability reflects to 
which extent a network can be predicted by the CN index, and on the other hand 
indicates prediction precision of the CN index. 5 = 1 means that if two nodes have 
more than L common neighbors, they are connected with a link definitely, as shown in 
Fig. 8 (a). 5 = 0 means that the network is totally unpredictable for the CN index. 
As shown in Fig. 8 (b), any node pairs of CN > 0 are not connected. Generally, 5 is 
between 0 and 1 for most of the real-world networks, and it is dependent on L. If 5 
increases with L, the node pair of a large CN value has large probability to be connected 
with a link, which means the network can be precisely predicted by the CN index. 


5.3. Simulation results 

We calculate the self-predictability of several large-scale real-world networks [HI HSl HU 
sg, of which the statistics are shown in Table 2. In the simulation, our fast algorithm is 
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to 


to 




L L L 

Figure 9. Self-predictability S vs lower bound L for large-scale real-world networks. 


employed to generate the node pairs with CN values greater than L. In Fig. 9, we see 
that for most of the real-world networks, 6 increases with L. This indicates that node 
pairs of large CN values have large probability to be connected with links, and these 
node pairs form the dense area of a network which embodies the intrinsic organization 
rule of a network. However, there are fluctuations and exceptions in the curves. For 
example, for Gplus, when L increases from 5000, 6 even decreases with L. This indicates 
that besides the “CN rule” there are some other factors affecting the organization and 
evolution of a network. 

6. Conclusion and discussion 

Although CN-based indices for link prediction have attracted much attention in the 
past few years, their performances are hard to evaluate in large and dense real-world 
networks. In our framework, to efficiently and precisely predict the links, only node 
pairs of CN values greater than the lower bound are considered. This is mainly based 
on the fact that in real society hot individuals are more attractive and are more likely 
to be recommended than the others. Then, we present two lemmas, based on which we 
further propose a fast parallel algorithm to calculate the CN values. Our algorithm works 
much more efficient than the other related algorithms in that by two delicate filtering 
operations, it greatly excludes the node pairs of CN values no greater than the lower 
bound in advance of the CN calculation. Thus, our algorithm is especially applicable 
for large-scale real-world networks, since these networks usually have heterogeneous CN 
value distributions, where a large number of node pairs just have small CN values, 
while a small fraction of node pairs have large CN values. Moreover, the efficiency of 
our algorithm increases exponentially with the increase of the lower bound of CN values. 
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map(M, adj_list[M] ) { 
foreach x in adj_list[M] 
foreach y in adj_list[M] 

if(a: < y) //to eliminate the repetition 
emit(relation= (x, |/),promote^core= 1) 

} 

reduce(pairList) { 
for pair in pairList 

add one score to the pair 


Table Al. The pair generating algorithm. 


Then, we propose a new metric, which is the self-predictability, for evaluating the 
performance of a similarity index in link prediction. Calculation of self-predictability 
needs not to divide the network into the train and probe sets, and thus it requires a 
lower computational cost than the metrics such as precision and AUC. 

We employ our fast algorithm to calculate the self-predictability of many large-scale 
real-world networks. We hnd that generally self-predictability increases with the lower 
bound of CN values, which indicates that two nodes with more common neighbors are 
more likely connected by a link. On the other hand, we hnd that there are huctuations 
and exceptions in the simulation results of self-predictability, which rehect that the “CN 
rule” is not the only law that governs the organization and evolution of a network. 

It is worth remarking that besides link prediction our fast algorithm can be also 
applied to the other CN-based problems in very large real-world networks. Also, the self- 
predictability is discussed in the context of CN index, while it can be easily generalized 
to evaluate the performance of the other similarity indices in link prediction. 

Appendix A. The pseudocodes of two traditional MapReduce-based 
algorithms 

The pseudocode of the pair generating algorithm [3l] is shown in Table Al. The 
pseudocode of the vectorization algorithm (3^ is shown in Table A2. 

Appendix B. Implementation of our algorithm based on RDD 

The resilient distributed datasets (RDDs) is an efficient, general-purpose and fault- 
tolerant abstraction for sharing data in cluster applications. RDDs can efficiently 
express many cluster programming models including MapReduce, SQL, Pregel and so 
on. Here, we also express our algorithm with RDD, which can be further implemented 
in Spark. On the other hand, in shared memory environment we can randomly access 
the data with its address, which can further reduces the amount of data emitted from 
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map(M, adj_list[M] ) { 
foreach x in adj_list[M] 

subarray=the nodes which number smaller than x 
emit(x,subarray)to reduce process 

} 

reduce(x,[ subarrayl,aubarray2,- • • ]) { 

foreach arrayltem in [ subarrayl,aubarray2,- • • ] 
foreach y in arrayltem 

add one score to the pair(a:, y) 


} 


Table A2. The vectorization algorithm. 


val hltered = adjGraph.£lter(s => s._2.size> L) //step 1 
.flatMap(s => for (i < — s._2) yield (i, s._l) ).groupByKey() 
.zipWithIndex() // step2 
val result = hltered.flatMap( s => 
for (i < —1 until s._l._2.size) 

yield(s._l._2(i), (s._2, i) ).groupByKey() // step3 
.hlter(s => s._2.size> L).hatMap { s => 
for (Ink < — s._2._2; i < — 0 until lnk._2) 

yield ((s._l, getByIndex(£ltered, hnk._l)._l._2(i)), 1) 
}.reduceByKey(+).£lter(s =>s.2> L) // step4 

Table Bl. The pseudocode of our algorithm with RDD. 


the map function to the reduce function. The RDD description of our algorithm for 
shared memory environment is shown in Table Bl. With RDD, the performances of our 
algorithm such as the volume of communication traffic, space utilization, etc. can be 
further optimized. 
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